Visualizations

Lennart Kasserra

CorrelAid

2024-04-22

Setup

library(dplyr)
library(ggplot2)

Using data from Gapminder this time, which comes as an R-package:

library(gapminder)
data("gapminder")

gm <- gapminder |> filter(year == 2007) # just use one year as subset for now

Visualizations with ggplot2:

  • gg = “Grammar of graphics

A heuristic for thinking about statistical visualizations. What is a visualization?

A mapping of data to aesthetic attributes (color, shape, size…) of geometric objects (points, lines, bars…)

How it translates to code: layers

We work in layers: just calling ggplot creates an empty base layer:

ggplot()

How it translates to code: data & mappings

Adding data & mappings for the axes:

ggplot(data = gm, mapping = aes(x = gdpPercap, y = lifeExp))

How it translates to code: geom_*s

Adding “geometric objects” (geom_*()s) to represent data points (here: points):

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp)) + # adding layers with "+"
  geom_point()

How it translates to code: other aes()

There are more attributes, like color:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point()

Exercise

  • Exercise: Recreate the above plot, and map size to the countries’ population (pop)! Bonus: Do you notice anything that could be problematic?

How it translates to code: static aesthetics

If we set aesthetic attributes outside aes(), they take on static values. For example, alpha determines the opaqueness of objects:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + 
  geom_point(alpha = 0.5)

Axes and labels

Let’s work on those labels; currently, legend labels for pop and axis labels for gdpPercap look… unfortunate. You can fix this using scale_*-layers and the scales-package:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + 
  geom_point(alpha = 0.5) +
  scale_size_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_currency())

Axes and labels

We could also try projecting GDP on a logarithmic scale given its distribution (change normal _continuous scale to _log10:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + 
  geom_point(alpha = 0.5) +
  scale_size_continuous(labels = scales::label_comma()) +
  scale_x_log10(labels = scales::label_currency())

Axes and labels: labs()

You can also set labels for axis & legend using labs():

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + 
  geom_point(alpha = 0.5) +
  scale_size_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_currency()) +
  labs(
    title = "GDP & Life Expectancy", 
    subtitle = "...for countries from 5 continents",
    x = "GDP p.C.", 
    y = "Life Expectancy (Years)",
    color = "Continent", 
    size = "Population"
  )

Themes

ggplot2 also has themes:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + 
  geom_point(alpha = 0.5) +
  scale_size_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_currency()) +
  labs(
    title = "GDP & Life Expectancy", 
    subtitle = "...for countries from 5 continents",
    x = "GDP p.C.", 
    y = "Life Expectancy (Years)",
    color = "Continent", 
    size = "Population"
  ) +
  theme_minimal()

Other geom_*s

Lines (geom_line()):

gapminder |> 
  filter(country == "Germany") |> 
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line()

Other geom_*s

Bars (geom_bar() & geom_col() (= columns, aka easier to use bars)):

gapminder |> 
  filter(country == "Germany") |> 
  ggplot(aes(x = year, y = gdpPercap)) +
  geom_col()

Other geom_*s

Violin- & Boxplots (geom_violin() & geom_boxplot())

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = year)) +
  geom_boxplot()

Combining geoms

You can also just slap multiple geoms on top of each other (they will be drawn in the order they are specified in the code):

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = year)) +
  geom_violin(alpha = 0.5) +
  geom_boxplot(fill = "grey") +
  geom_point(position = position_jitter(width = 0.3), alpha = 0.2)

Combining geoms

Let’s add a trend line to our original plot. This may get complicated for more complex plots; be mindful of which aes() belong to which geom:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_line(stat = "smooth", method = stats::loess, linetype = "dashed", color = "grey", size = 1) +
  # ^ Alternatively, look into geom_smooth (can e.g. also display confidence interval)
  geom_point(aes(color = continent, size = pop), alpha = 0.5) +
  scale_size_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_comma()) +
  labs(
    title = "GDP & Life Expectancy", 
    subtitle = "...for countries from 5 continents",
    x = "GDP p.C.", 
    y = "Life Expectancy (Years)",
    color = "Continent", 
    size = "Population"
  ) +
  theme_minimal()

Combining geoms

Or linear & by continent:

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_line(stat = "smooth", method = stats::lm, linetype = "dashed", size = 1, alpha = 0.3) +
  geom_point(aes(size = pop), alpha = 0.5) +
  scale_size_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_comma()) +
  labs(
    title = "GDP & Life Expectancy", 
    subtitle = "...for countries from 5 continents",
    x = "GDP p.C.", 
    y = "Life Expectancy (Years)",
    color = "Continent", 
    size = "Population"
  ) +
  theme_minimal()

Wraps

Let’s say you wanted to plot the distribution of life expectancy every year by continent. Would you have to filter() and make plots for all continents separately? There is of course a smarter way: we can “wrap” the plot by continent:

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = year, fill = continent)) +
  geom_boxplot() +
  theme_minimal() +
  facet_wrap(~continent)

Axes, scales & theme()-elements

All visual elements of the plot can be changed inside theme() (the ggplot2-internal syntax may seem a bit strange at first):

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = year, fill = continent)) +
  geom_boxplot() +
  facet_wrap(~continent) +
  theme(
    axis.text.x = element_text(angle = 45), 
    legend.position = "bottom"
  )

Color & fill scales

gm |> 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + 
  geom_point(alpha = 0.7) +
  labs(
    title = "GDP & Life Expectancy", 
    subtitle = "...for countries from 5 continents",
    x = "GDP p.C.", 
    y = "Life Expectancy (Years)",
    color = "Continent", 
    size = "Population"
  ) +
  theme_minimal() +
  scale_color_viridis_d()

  # Or others:
  #scale_color_brewer(palette = "Accent")

If you are interested in visualizations, you can read up on color & fill scales and color theory for ggplot2 here.

Exercise time

Load the “Palmer’s Penguins” data set. Make visualizations to explore the data. For example, look at bill_length_mm and bill_depth_mm by species or body_mass_g by species and sex, or any other relationship!

# install.packages("palmerpenguins")
library(palmerpenguins)

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Visualizing results

Here we will look at how to visualize model results. Remember our model from last time:

bigfoot <- readr::read_csv(here::here("data/bigfoot.csv"))

bf_model <- 
  bigfoot |> 
  mutate(
    # Temperature to celsius:
    temp_celsius = (temperature_mid - 32) / 1.8,
    # Construct outcome:
    reliable_sighting = if_else(classification == "Class A", 1, 0)
  ) |> 
  lm(
    reliable_sighting ~ temp_celsius + humidity + moon_phase + cloud_cover + uv_index,
    data = _
  )

Visualizing results

The first step is to get the model object into a proper data frame we can feed into ggplot:

bf_model |> 
  broom::tidy()
## # A tibble: 6 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   0.551     0.0615      8.96  5.48e-19
## 2 temp_celsius  0.00361   0.00140     2.58  9.85e- 3
## 3 humidity     -0.157     0.0786     -1.99  4.64e- 2
## 4 moon_phase    0.0123    0.0322      0.382 7.03e- 1
## 5 cloud_cover   0.0980    0.0377      2.60  9.48e- 3
## 6 uv_index     -0.00911   0.00507    -1.80  7.27e- 2

Visualizing results

We now use mutate to add upper & lower bounds of the confidence interval:

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  )
## # A tibble: 6 × 7
##   term         estimate std.error statistic  p.value    upper    lower
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
## 1 (Intercept)   0.551     0.0615      8.96  5.48e-19  0.613    0.490  
## 2 temp_celsius  0.00361   0.00140     2.58  9.85e- 3  0.00501  0.00221
## 3 humidity     -0.157     0.0786     -1.99  4.64e- 2 -0.0780  -0.235  
## 4 moon_phase    0.0123    0.0322      0.382 7.03e- 1  0.0446  -0.0199 
## 5 cloud_cover   0.0980    0.0377      2.60  9.48e- 3  0.136    0.0602 
## 6 uv_index     -0.00911   0.00507    -1.80  7.27e- 2 -0.00403 -0.0142

Visualizing results

We also don’t need the y-intercept:

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  ) |> 
  filter(term != "(Intercept)")
## # A tibble: 5 × 7
##   term         estimate std.error statistic p.value    upper    lower
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 temp_celsius  0.00361   0.00140     2.58  0.00985  0.00501  0.00221
## 2 humidity     -0.157     0.0786     -1.99  0.0464  -0.0780  -0.235  
## 3 moon_phase    0.0123    0.0322      0.382 0.703    0.0446  -0.0199 
## 4 cloud_cover   0.0980    0.0377      2.60  0.00948  0.136    0.0602 
## 5 uv_index     -0.00911   0.00507    -1.80  0.0727  -0.00403 -0.0142

Visualizing results

Make the actual plot:

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  ) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = term, y = estimate)) +
  geom_pointrange(aes(ymin = lower, ymax = upper))

Visualizing results

Add a line to show where 0 is:

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  ) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = term, y = estimate)) +
  geom_hline(yintercept = 0, lty = "dashed", color = "darkgrey") +
  geom_pointrange(aes(ymin = lower, ymax = upper))

Visualizing results

Reorder covariates by effect strength (read: reorder term by estimate):

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  ) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = forcats::fct_reorder(term, estimate), y = estimate)) +
  geom_hline(yintercept = 0, lty = "dashed", color = "darkgrey") +
  geom_pointrange(aes(ymin = lower, ymax = upper))

Visualizing results

You could also flip the axis using coord_flip():

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  ) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = forcats::fct_reorder(term, estimate), y = estimate)) +
  geom_hline(yintercept = 0, lty = "dashed", color = "darkgrey") +
  geom_pointrange(aes(ymin = lower, ymax = upper)) +
  coord_flip()

Visualizing results

Cosmetics:

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error
  ) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = forcats::fct_reorder(term, estimate), y = estimate)) +
  geom_hline(yintercept = 0, lty = "dashed", color = "darkgrey") +
  geom_pointrange(aes(ymin = lower, ymax = upper)) +
  coord_flip() +
  theme_minimal() +
  labs(
    x = "", 
    y = "Estimate", 
    title = "Predictors of reliable bigfoot sightings"
  )

Visualizing results

Bonus: Add stars for significance:

bf_model |> 
  broom::tidy() |> 
  mutate(
    upper = estimate + std.error,
    lower = estimate - std.error,
    sig = case_when(
      p.value < 0.01 ~ "***", 
      p.value < 0.05 ~ "**", 
      p.value < 0.1 ~ "*",
      TRUE ~ ""
    )
  ) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = forcats::fct_reorder(term, estimate), y = estimate)) +
  geom_hline(yintercept = 0, lty = "dashed", color = "darkgrey") +
  geom_pointrange(aes(ymin = lower, ymax = upper)) +
  geom_text(aes(label = sig), nudge_x = 0.1) +
  coord_flip() +
  theme_minimal() +
  labs(
    x = "", 
    y = "Estimate", 
    title = "Predictors of reliable bigfoot sightings"
  )

Visualizing results

Tips:

  • You can also use geom_errorbar()
  • You could modify the term column using stringr to have nicer names (e.g. “Cloud Cover” instead of “cloud_cover”)

Final exercise

Load the data on Himalaya-expeditions (click here for more details) again:

members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv')
expeditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv')
peaks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/peaks.csv')
  • Merge & clean the data (if neccessary)
  • Explore the data
  • Make visualizations
  • Fit a model (e.g. predict whether an expedition member dies)
  • Visualize & communicate your results